Groups 1 & 2

Contradiction Check

There are 432 women with Y = 1.

All Y = 1 cases have at least one active match_* flag, indicating consistency between labels and diagnosis sources.

Among positive cases:

346 women have exactly one source.

72 have two sources.

14 have three sources.

All Y = 0 cases have no active match_* flags.

The match_measure_after column belongs to excluded rows, so it's always missing and was removed.

match_rasham_after is not informative on its own, as it only appears alongside match_aspirin_after or match_pdf_after.

There are no missing values in Group 1 columns.

Some women have multiple active flags, but most have only one.

Group 2 - Diagnostic Subtype Counts

Empty columns here indicate excluded women and were removed from further processing

Most women had no diagnosis or a single one, only a few had multiple diagnoses

Combination Table (count) Most women (9713 out of 10,000) did not receive any diagnosis (all Group 2 variables = 0).

The most common scenarios among Y=1 cases are:

Only preeclampsia_sum = 1 (65 women)

Only pregnancy_hypertension_sum = 1 (53 women)

Only essential_hypertension_sum = 1 (45 women)

A small minority have multiple diagnose. labs_sum appears alone in 35 women - I assumed it reflects monitoring, not high clinical severity.

Group3 - Features by Group

Drop constant

Handle features by types (prefix)

init_date

The variable likely represents an internal feature, but is non-informative as it only encodes a date.

Demog

Missing values in demographic columns were retained due to negligible impact (only 9 rows)

Smoking

Smoking Analysis

Rows with missing 'smoking_is_smoker' are younger, have lower capitation, and show similar Y distribution.

Suggests MAR: missingness likely related to observed variables (age, capitation).

Handle Smoking Annomals

Smoking years above 40 are unlikely for pregnant women, as most are under 50 and unlikely to have started smoking before age 10. Values above 45 are considered extreme outliers.

Smoking summary

In the current dataset, the smoking-related columns (smoking_is_smoker, smoking_smoking_years, smoking_total_heavy_smokers) suffer from extensive missing values and inconsistencies (such as unrealistic values or unclear categorical encoding), significantly reducing their predictive utility.

No missing indicators were added, as no imputation was performed and the model can handle NaNs. 397 rows with missing smoking_smoking_years were retained due to valuable information in other features (e.g., age, capitation). Missingness in smoking_is_smoker appears MAR and has no significant impact on Y, so these rows were kept.

Lab

I presented the distribution plots and boxplots not to inspect each individual feature, but rather to ensure that the overall scale and value ranges made sense visually and were reasonably coherent across features.

Outlier ranges were reviewed but not removed, as values are within a similar order of magnitude and appear consistent.

Several lab features show very high correlations (e.g., MoM vs. absolute values, CBC parameters, white blood cell subtypes).

I did not manually remove any of them at this stage, instead, I rely on the Elastic Net model to handle redundant features through regularization.

measure

Missing values were not imputed, as tree-based models (e.g., LGBM) handle NaNs natively.
Imputing derived features may introduce noise without clear benefit.
Leaving them missing allows the model to learn whether the missingness itself is informative.
Here, missingness in blood pressure variability shows only a slight difference in Y distribution, suggesting the effect is minor and likely not MNAR.


Multiple systolic and diastolic blood pressure features are highly correlated (r > 0.9), especially between min, max, mean, first, and last values.
This indicates significant redundancy, likely due to being derived from the same set of measurements.
Feature selection or regularized models will be used to reduce multicollinearity.

4 & 24

num_of_diag_cols

The value 112 in 24_diag_80_num_of_diag was deemed implausible based on

clinical notes, which show no indication of repeated or complex diagnoses.

It was treated as a data error and replaced with NaN.

days_since_diag_cols

Many *_days_since_last_diag features exhibit strong pairwise correlations (r > 0.8), indicating redundancy.
This suggests multicollinearity and justifies feature selection or use of regularized models amount of features.

EDA

Most women with gestational hypertension do not have chronic hypertension (68.5%).

About 31.5% have both conditions – indicating partial overlap.

Among women with chronic hypertension, 35% also experience gestational hypertension – possibly suggesting a risk factor.

There is a connection between the two conditions, but they are not fully overlapping.

The severity_level variable was created for exploratory analysis, assigning each patient a single severity label based on priority.
This simplifies the view for visualization and summary, without affecting the original detailed features.

Severity Level Distribution

Among patients diagnosed with any form of hypertension during pregnancy, Preeclampsia is the most prevalent subtype, followed by Essential and Pregnancy Hypertension. Lab-only diagnoses are less frequent, and Eclampsia, the most severe condition, is rare.

Demogarphy distribution

Demogarphy distribution by Y (Boxplot)

Women with Y=1 are slightly older on average than those with Y=0, with overlapping age ranges, suggesting a mild age-related risk.

Demogarphy distribution by Severity Level (Boxplot)

Smoking status is fairly similar between groups, with no clear or significant link to Y or diagnosis type.

fmt=".1f" displays numbers as 120.5 instead of 1.2e+02.

Average blood pressure (systolic and diastolic) is higher among women with Y=1 and in groups with greater clinical severity — especially Eclampsia, Preeclampsia, and Pregnancy Hypertension.

Exploratory Data Analysis of Predictive Features Based on Prior Literature Review

NLP - Clinical Sheet

Train - Test split

Using stratified sampling ensures that each diagnosis source maintains similar distributions in both the training and test sets. This prevents imbalance and ensures the model generalizes effectively, accurately reflecting real-world performance on unseen data.

Rare stratification groups were consolidated under a common label ('rare') to ensure robust and valid stratified splitting.

Embeddings

This ensures there is no overlap between Train and Test texts, preventing data leakage from shared embeddings.

TF-IDF

Mutual Information was used because it captures non-linear dependencies and makes no assumptions about the distribution, effectively identifying informative TF-IDF features for the binary target (Y).

Identifies top words most associated with each class (Y=0 / Y=1) based on TF-IDF and Mutual Information.

I chose a Mutual Information threshold of 0.005 as values above this point indicate meaningful predictive information, reducing potential noise. Additionally, I assumed that subsequent regularization (Elastic Net/LGBM) would further exclude less predictive features.

Add selected words as features

Predictive word features were extracted using TF-IDF and Mutual Information from the train set only, and applied to the test set without any label exposure or retraining.

Regularization & Standartization

I divided the dataset into three groups for appropriate scaling and regularization:

Clinical data: Scaled and processed using ElasticNet on the training set to identify key features.

Words & Embeddings: No scaling needed (already on a uniform scale), processed separately via LASSO on the training set to eliminate irrelevant features.

All feature selection steps were performed exclusively on the training data to prevent information leakage. The selected features were then combined for the final modeling phase.

ElasticNet

Lasso

Data-Leak check

Model

Motivation for using LightGBM:

Excellent handling of imbalanced datasets by weighting minority class (patients with hypertension).

Built-in capability to manage missing values without explicit imputation.

High computational efficiency, suitable for large and complex datasets.

Effective internal regularization and feature selection, addressing multicollinearity.

Facilitates interpretability through SHAP analysis, aiding clinical insights.

Evaluation

Top 1%

Error analysis

Feature Importance